Assignment instructions:

Assignment submission (Marina Kochuten): ______________________________________


# Load libraries
library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions) 

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

# read in the data, convert "-99999" to NA, and clean col names
rawdata <- read_csv(here("data", "spiny_abundance_sb_18.csv"), na = "-99999") |>
    clean_names()

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
# Add long lables to our sites and save in a col named reef
tidydata <- rawdata |>
    mutate(reef = factor(site, 
                         levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"), 
                         labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", 
                                    "Isla Vista",  "Naples")))

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

# assign each site either mpa or non-mpa, and 1 or 0
spiny_counts <- tidydata |>
    group_by(site, year, transect) |>
    summarise(count = sum(count, na.rm = TRUE), mean_size = mean(size_mm, na.rm = TRUE)) |>
    mutate(mpa = case_when(site %in% c("IVEE", "NAPL") ~ "MPA",
                           .default = "non_MPA")) |>
    mutate(treat = case_when(mpa == "MPA" ~ 1,
                             .default = 0)) |>
    ungroup()

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: Ridge plot grouped by reef site
library(ggridges)

spiny_counts %>% 
    ggplot(aes(x = count, y = site)) +
    geom_density_ridges2(quantile_lines = TRUE,
                         quantiles = 2,
                         alpha = 0.3,
                         fill = "coral") +
    xlab("California Spiny Lobster Count") +
    ylab("Reef Site") +
    labs(title = "Distribution of California Spiny Lobster Counts by Reef Site & Median Counts") +
    theme_minimal()

# Plot 2: Jitter plot of counts grouped by MPA status
spiny_counts |>
    ggplot(aes(y = count, x = mpa)) +
    geom_boxplot(width = 0.2, outliers = FALSE) +
    geom_jitter(width = 0.3, color = "coral", alpha = 0.5) +
    xlab("MPA Status") +
    ylab("California Spiny Lobster Count") +
    labs(title = "Distribution of California Spiny Lobster Counts by MPA status") +
    theme_minimal()

# Plot 3: Grouped by year
spiny_counts %>% 
    ggplot(aes(x = count, y = factor(year))) +
    geom_violin(trim = TRUE) +
    stat_summary(fun.y = mean, geom = "point", color = "coral", size = 3, alpha = 0.7)  +
    xlab("California Spiny Lobster Count") +
    ylab("Year") +
    labs(title = "Distribution of California Spiny Lobster Counts by Year & Yearly Average Counts") +
    theme_minimal()

# Plot 4: Plot of lobster size
library(ggbeeswarm)

spiny_counts |>
    ggplot(aes(x = mean_size, fill = mpa)) +
    geom_density(alpha = 0.5) +
    scale_fill_manual(values = c("coral", "lightyellow")) +
    xlab("California Spiny Lobster Mean Size (mm)") +
    ylab("Density") +
    labs(title = "Distribution of CA Spiny Lobster Size by MPA Status") +
    theme_minimal() +
    geom_vline(xintercept = mean(spiny_counts$mean_size, na.rm = TRUE), color = "black") +
    geom_label(aes(x = 61.5, 
                  y = 0.065,
                  label = paste("Overall Lobster Mean Size = ", round(mean(spiny_counts$mean_size, na.rm = TRUE), 2), "mm")),
               size = 3,
               show.legend = FALSE)

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
 spiny_counts |> 
    dplyr::select(count, mean_size, mpa) |>
    tbl_summary(by = mpa,
                statistic = list(all_continuous() ~ "{mean}")) |>
    modify_caption("**Comparing CA Spiny Lobster mean counts and sizes between MPA and non-MPA sites**")
Comparing CA Spiny Lobster mean counts and sizes between MPA and non-MPA sites
Characteristic MPA
N = 119
1
non_MPA
N = 133
1
count 28 23
mean_size 76 73
    Unknown 12 15
1 Mean

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience. - At the non-MPA reef sites, there are about 23 CA Spiny Lobsters. At MPA designated reef sites, there are about 5 more CA Spiny Lobsters for a total count of 28 lobsters.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(count ~ treat,
             data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable count
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

The qq plot above shows us that the residuals are not normally distributed, which violates one of the key assumptions of OLS. This tells us that OLS may not be the best model for our data.

check_model(m1_ols, check = "normality")

This plot above showing the distribution of our residuals shows us, again, that they are not normally distributed, violating a key OLS assumption. In this case, the residuals are closer to a log-normal distribution.

check_model(m1_ols, check = "homogeneity")

Another assumption of OLS is that the residuals have constant variance. In this plot above, we can see that the variance is not constant, violating the OLS assumption.

check_model(m1_ols, check = "pp_check")

In the figure above, we can see that the actual (observed) data is not a good fit to what we would expect from the model. Because these 4 plots show us that key assumptions of OLS are not being met, and that the observed data is not a good match to the model-predicted data, we can say that OLS is not the best model in this case.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

# Estimate Poisson regression model
m2_pois <- glm(count ~ treat,
               family = poisson(link = "log"),
               data = spiny_counts)

# Print poisson model output
summ(m2_pois, model.fit = FALSE)
Observations 252
Dependent variable count
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE
# Interpret results as percent change
# Use this way to avoid hard-coding? exp(m2_pois$coefficients[2]) - 1
exp(0.21)-1
## [1] 0.2336781

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

d. Compare results with previous model, explain change in the significance of the treatment effect

e. Check the model assumptions. Explain results.

check_model(m2_pois)

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001

This test implies the dispersion (variance) is significantly larger than the mean!

check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

This test implies that the model is predicting less zeros than actually occur in the data (underfitting zeros), which means there is probable zero-inflation. This model may not be the best fit for our data.

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

Negative binomial regression is a generalization of Poisson regression which loosens the restrictive assumption that the variance is equal to the mean. In the overdispersion test above, we saw that in this case variance is significantly larger than the mean. Therefore, we needed to pick a model does not assume equal variance and mean.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

In this model, there are 23% more lobsters in our treatment (MPA) group than our control (non-MPA) group. This effect is the same that we observed in the previous model.

# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(count ~ treat,
               data = spiny_counts)

# Print negative binomial model output
summ(m3_nb, model.fit = FALSE)
Observations 252
Dependent variable count
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE
# Interpret results as percent change
exp(0.21)-1
## [1] 0.2336781
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

In the OLS model, there are about 5 more lobsters in MPA sites than non-MPA sites, or a 23.6% increase in lobsters in MPA sites. In the Poisson and the negative binomial model, there is a 23.4% increase in lobsters in MPA sites. The treatment effect is stable across the model specifications.

export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following OLS model using glm.nb()

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

d. Explain why the main effect for treatment is negative? *Does this result make sense?

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    count ~
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable count
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (d) above.

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response")  # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis to counts

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- spiny_counts |>
    group_by(year, mpa) |> 
    summarise(mean_count = mean(count, na.rm = TRUE)) |> 
    mutate(year = as_factor(year)) |>
    ungroup()

plot_counts |>  
    ggplot(aes(x = year, y = mean_count, group = mpa, color = mpa, linetype = mpa)) +
    geom_point(size = 4) +
    geom_line(size = 1) +
    scale_color_manual(values = c("#1D4677", "#A9D1F6" ),
                       labels = c("MPA", "Non-MPA")) +
    scale_linetype_manual(values = c("solid", "longdash"),
                          labels = c("MPA", "Non-MPA")) +
    theme_minimal() +
    labs(title = "Mean lobster counts in MPA vs non-MPA designated reefs from 2012 - 2018",
         color = "MPA designation",
         linetype = "MPA designation") +
    xlab("Year") +
    ylab("Mean Lobster Count")


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)
  1. Explain why spillover is an issue for the identification of causal effects
  1. How does spillover relate to impact in this research setting?
  1. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption

      • In our case, the SUTVA implies that MPA treatment effects are applied equally to all lobsters within MPAs, and that lobsters in the control groups do not recieve any affect from neighboring MPAs. This assumption would be violated if there is in fact a spillover effect occurring in our experiment, however without direct proof that spillover occurs, I think that the assumption is reasonable.
    2. Excludability assumption

    • The excludability assumption implies that the MPA treatment is the sole causal effect on the outcome. In this case, it is very possible that there are outside environmental factors at play that effect lobster counts, like warming. However, considering that our sites are fairly close together, any environmental effects at play should affect all of our sites, and therefore not change any comparisons between our treatment and control sites. For this reason, I think that excludability is a fair assumption.

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)